Towards High Speed Grammar Induction on Large Text Corpora
نویسندگان
چکیده
In this paper we describe an e cient and scalable implementation for grammar induction based on the EMILE approach ([2], [3],[4], [5], [6]). The current EMILE 4.1 implementation ([11]) is one of the rst e cient grammar induction algorithms that work on free text. Although EMILE 4.1 is far from perfect, it enables researchers to do empirical grammar induction research on various types of corpora. The EMILE approach is based on notions from categorial grammar (cf. [10]), which is known to generate the class of context-free languages. EMILE learns from positive examples only (cf. [1], [7], [9]). We describe the algorithms underlying the approach and some interesting practical results on small and large text collections. As shown in the articles mentioned above, in the limit EMILE learns the correct grammatical structure of a language from sentences of that language. The conducted experiments show that, put into practice, EMILE 4.1 is e cient and scalable. This current implementation learns a subclass of the shallow context-free languages. This subclass seems su ciently rich to be of practical interest. Especially Emile seems to be a valuable tool in the context of syntactic and semantic analysis of large text corpora.
منابع مشابه
Learning Classifier System Approach to Natural Language Grammar Induction
This paper describes an evolutionary approach to the problem of inferring non-stochastic context-free grammar (CFG) from natural language (NL) corpora. The approach employs Grammar-based Classifier System (GCS). GCS is a new version of Learning Classifier Systems in which classifiers are represented by CFG in Chomsky Normal Form. GCS has been tested on the NL corpora, and it provided comparable...
متن کاملTiny Corpus Applications with Transformation-Based Error-Driven Learning : Evaluations of Automatic Grammar Induction and Partial Parsing of SaiSiyat
This paper reports a preliminary result on automatic grammar induction based on the framework of Brill and Markus (1992) and binary-branching syntactic parsing of Esperanto and SaiSiyat (a Formosan language). Automatic grammar induction requires large corpus and is found implausible to process endangered minor languages. Syntactic parsing, on the contrary, needs merely tiny corpus and works alo...
متن کاملSemi-automatic acquisition of domain-specific semantic structures
This paper describes a methodology for semi-automatic grammar induction from unannotated corpora belonging to a restricted domain. The grammar contains both semantic and syntactic structures, which are conducive towards language understanding. Our work aims to ameliorate the reliance of grammar development on expert handcrafting or the availability of annotated corpora. To strive for a reasonab...
متن کاملA Systematic Comparison between Inversion Transduction Grammar and Linear Transduction Grammar for Word Alignment
We present two contributions to grammar driven translation. First, since both Inversion Transduction Grammar and Linear Inversion Transduction Grammars have been shown to produce better alignments then the standard word alignment tool, we investigate how the trade-off between speed and end-to-end translation quality extends to the choice of grammar formalism. Second, we prove that Linear Transd...
متن کاملEvolutionary Computing as a Tool for Grammar Development
In this paper, an agent-based evolutionary computing technique is introduced, that is geared towards the automatic induction and optimization of grammars for natural language (grael). We outline three instantiations of the grael-environment: thegrael-1 system uses large annotated corpora to bootstrap grammatical structure in a society of autonomous agents, that tries to optimally redistribute g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000